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An MDL framework for sparse coding and 

dictionary learning 

Ignacio Ramirez and Guillermo Sapiro 
Abstract 

The power of sparse signal modeling with learned over-complete dictionaries has been demonstrated in a variety 
of applications and fields, from signal processing to statistical inference and machine learning. However, the statistical 
properties of these models, such as under-fitting or over-fitting given sets of data, are still not well characterized 
in the literature. As a result, the success of sparse modeling depends on hand-tuning critical parameters for each 
data and application. This work aims at addressing this by providing a practical and objective characterization of 
sparse models by means of the Minimum Description Length (MDL) principle - a well established information- 
theoretic approach to model selection in statistical inference. The resulting framework derives a family of efficient 
sparse coding and dictionary learning algorithms which, by virtue of the MDL principle, are completely parameter 
free. Furthermore, such framework allows to incorporate additional prior information to existing models, such as 
Markovian dependencies, or to define completely new problem formulations, including in the matrix analysis area, 
in a natural way. These virtues will be demonstrated with parameter-free algorithms for the classic image denoising 
and classification problems, and for low-rank matrix recovery in video applications. 

Index Terms 

Sparse coding, dictionary learning, MDL, denoising, classification, low-rank matrix completion. 

L Introduction 

A sparse model is one in which signals of a given type y G can be represented accurately as sparse linear 
combinations of the columns (atoms) of a learned dictionary D G W^^^, y = Da + e, where by accurate we mean 
that ||e|| <C ||y|| (in some norm), and by sparse we mean that the number of non-zero elements in a, denoted by 
||a||Q, is small compared to its dimension p. These concepts will be formalized in the next section. 

Such models, especially when D is learned from training samples, are by now a well established tool in a variety 
of fields and applications, see [1], [2], [ ] for recent reviews. 

When sparsity is a modeling device and not an hypothesis about the nature of the analyzed signals, parameters 
such as the desired sparsity in the solutions, or the size p of the dictionaries to be learned, play a critical role 
in the effectiveness of sparse models for the data and tasks at hand. However, lacking theoretical guidelines for 
such parameters, published applications based on learned sparse models often rely on either cross-validation or 
ad-hoc methods for determining such critical parameters (an exception for example being the Bayesian approach, 
e.g., [ ]). Clearly, such techniques can be impractical and/or ineffective in many cases. This in turn hinders the 
further application of such models to new types of data and applications, or their evolution into different, possibly 
more sophisticated, models. 

At the bottom of the aforementioned problem lie fundamental questions such as: How rich or complex is a sparse 
model? How does this depend on the required sparsity of the solutions, or the size of the dictionaries? What is the 
best model for a given data class and a given task? 

The general problem of answering such questions and, in particular, the latter, is known as model selection. 
Popular model selection techniques such as Akaike's Information Criterion (AIC) [ ], Bayes Information Criterion 
(BIC) [ ], and the Minimum Description Length principle (MDL) [7], [8], [9], work by building a cost function 
which balances a measure of goodness of fit with one of model complexity, and search for a model that minimizes 
such cost. In this sense, these tools can be regarded as practical implementations of the Occam's razor principle, 
which states that, given two (equally accurate) descriptions for a given phenomenon, the simpler one is usually the 
best. 
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In the Minimum Description Length principle, given a family or model class M of candidate models indexed by 
a parameter M, and a data sample y, the best model M G is the one that can be used to describe y completely 
(including the parameters M themselves) with the fewest number of bits, 

M arg min L(y, M), (1) 

MeM 

where L(y, M) is a codelength assignment function which defines the theoretical codelength required to describe 
(y, M) uniquely, and which is a key component of any MDL-based framework. The underlying idea of MDL is 
that compressibility is a good indirect way of measuring the ability of a model to capture regularity from the data. 
Common practice in MDL uses the Ideal Shannon Codelength Assignment [ , Chapter 5] to define L(y,M) in 
terms of a probability assignment P(y, M) as L(y, M) = — logP(y, M) (all logarithms will be assumed on base 
2 hereafter). In this way, the problem of choosing L( ) becomes one of choosing a suitable probability model for 
(y, M). Note here how MDL considers probability models not as a statement about the true nature of the data, 
but only as a modeling tool. If we now write P(y,M) = P(y|M)P(M), we obtain the more familiar penalized 
likelihood form, 

M arg min -logP(y|M) - logP(M), (2) 

with — logP(M) representing the model complexity, or model cost, term. 

The use of MDL for sparse signal modeling has been explored for example in the context of wavelet-based 
denoising (where M = a G R^, p = m and D G W^^^ is fixed) of images corrupted by additive white Gaussian 
noise (AWGN) [ 1], [1^], [ ], [14], [15]. In [11], [12], [13], the data is described using (2) with -logP(y|a) 
assumed to be solely due to noise, and an L(a) term which exploits sparsity, 

1 2 

a = arg min — ^ ||y - Da||2 + L(a). (3) 

Here the first term corresponds to the ideal codelength, up to a constant, of an IID Gaussian sequence of zero 
mean and known variance a^. The difference between [ ],['"*], [^ ] lies in the definition of L(a). The line of 
work [14], [1 ] follows the modern MDL approach by using sophisticated tools from coding theory, the so called 
one-part universal codes, which encodes (y, a) jointly, and reduces the arbitrariness in defining L(a). However, 
such tools can only be applied for certain choices of P(y|a) and P(a). In the case of [14], [1j], the choice is 
to use continuous Gaussian models for both. As Gaussian models are not well suited to the typically observed 
statistical properties of such data, the performance of the resulting denoisers for example is very poor compared 
to the current state-of-the-art. 

The present work extends and/or improves on the aforementioned work in the following ways:^ 

• MDL-based sparse coding is extended to the case of non-orthonormal, possibly over-complete and learned 
dictionaries D. As we will see in Section V, this extension, critical to deal with modem, very successful 
sparse modeling approaches, poses not only new design problems but also significant computational challenges 
compared to the orthonormal case. 

• Efficient codelengths (probability distributions) for the different components to encode (error, coefficients, 
dictionary) are obtained by applying universal coding schemes to priors that are suited to the typically observed 
statistics of such data. 

• As a particular point of the above item, systematic model-fit deviations are naturally taken into account in 
P(y|a). The resulting fitting terms fall into the category of robust estimators (see [ ]), thus marrying robust 
statistics with information theory and with sparse modeling (dictionary learning). 

• We comply with the basic MDL sanity check, meaning, that the theoretical codelengths obtained are smaller 
than a ''raw'' description of the data. We do so by including quantization in our models, and treating its effect 
rigorously. 

• Dictionary learning within the MDL framework allows us to optimize both the number of atoms p, as well as 
their structure, resulting in a natural and objective form of regularization for D. 

^This paper extends preliminary results reported in [ ]. In particular, new dictionary learning algorithms are developed which include li 
atom regularization, forward and backward dictionary size adaptation. We also develop a new model for the low-rank matrix approximation 
problem. 
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• Structure is naturally added to the sparse models in the form of Markovian dependencies between adjacent 
data samples. We also show an extension of the model to the problem of low-rank matrix completion. 

As a result of the above features, we obtain for the first time an MDL-based, parameter-free framework for signal 
modeling that is able to yield state-of-the-art results. 

At the theoretical level, this brings us a step closer to the fundamental understanding of learned sparse models 
and brings a different perspective, that of MDL, into the sparse modeling world. 

The remainder of this paper is organized as follows. Sparse models, and the associated notation, are described 
in detail in Section II. Section III introduces MDL, and its application to sparse models. In Section IV we present 
the probability models used to assign codelengths to different parts of the encoded data, while sections V and VI 
describe the actual sparse coding and dictionary learning algorithms developed. Experimental results follow in 
Section VII, and the paper is concluded in Section VIII. 

II. Background on sparse modeling 

Assume we are given n m-dimensional data samples ordered as columns of a matrix Y = [yi|y2| • • • |yn] ^ 
j^mxn Consider a linear model for Y, Y = DA + E, where D = [di|d2| . . . |dp] is an mxp dictionary consisting 
of p atoms, A = [ai|a2| . . . la^^] G R^^^ is a matrix of coefficients where each j-th column aj specifies the linear 
combination of columns of D that approximates y^, and E = [ei|e2| . . . |e^] G W^^^ is a matrix of approximation 
errors. 

We define the support, or active set, of a vector a G as supp(a) = {/c : a^^ ^ 0}. Let V = supp(a). We also 
represent the support of a as a binary vector z G {0, 1}^ such that = 1 for z G F, and otherwise. We refer to 
the sub-vector in RI^I of non-zero elements of a as either ajp] or aj^j. Both conventions are extended to refer to 
sets of columns of matrices, for example, D[r] is the matrix formed by the |r| columns of D indexed by F. We 
will use the pseudo-norm ||a||Q := |r| = to denote the number of non-zero elements of a. We say that the 
model is sparse if we can achieve ||ej||2 <C ||yj II2 ^i^d ||a||Q <C p simultaneously for all or most j = 1, . . . , n. 

The result of quantizing a real- valued variable y to precision 5 is denoted by [y]^. This notation is extended to 
denote element-wise quantization of vector (e.g., [e]) and matrix operands (e.g., [E]). 

A. Sparse coding 

One possible form of expressing the sparse coding problem is given by 

a^- = argmm ||y^--Du||2 s.t. ||u||o < 7, (4) 

where 7 <C p indicates the desired sparsity level of the solution. Since problem (4) is non-convex and NP-hard, 
approximate solutions are sought. This is done either by using greedy methods such as Matching Pursuit (MP) 
[10], or by solving a convex approximation to (4), commonly known as the lasso [ ], 

a.- = arer min - llyo — DulL s.t. ||u||i < r. (5) 

There exists a body of results showing that, under certain conditions on 7 and D, the problem (4) can be solved 
exactly via (5) or MP (see for example [ ], [ ]). In other cases, the objective is not to solve (4), but to guarantee 
some property of the estimated kj. For example, in the above mentioned case of AWGN denoising in the wavelets 
domain, the parameter r can be chosen so that the resulting estimators are universally optimal with respect to some 
class of signals [21]. However, if D is arbitrary, no such choice exists. Also, if D is orthonormal, the problem (5) 
admits a closed form solution obtained via the so-called soft thresholding [ ]. However, again, for general D, no 
such solution exists, and the search for efficient algorithms has been a hot topic recently, e.g., [ j], [^-^]. 

B. Dictionary learning 

When D is an optimization variable, we refer to the resulting problem as dictionary learning: 

1 

(A,D) = argmin^ - ||y^- - Da_^-||2 s.t. ||a_^-||^ < rVj, ||dfc||2 < IVfc, (6) 
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with < r < 1. The constraint ||dj[;^||2 < 1 , A: = 1, . . . is necessary to avoid an arbitrary decrease of the cost 
function by setting D ^ aT>, A ^ ^ A, for any a > 1. The cost function in (6) is non-convex in (A, D), so that 
only local convergence can be guaranteed. This is usually achieved using alternate optimization in D and A. See 
for example [25], [26] and references therein. 

C. Issues with traditional sparse models: a motivating example 

Consider the K-SVD-based [ ] sparse image restoration framework [27]. This is an ^o-based dictionary learning 
framework, which approximates (6) for the case r = by alternate minimization. In the case of image denoising, 
the general procedure can be summarized as follows: 

1) An initial, global dictionary Dq is learned using training samples for the class of data to be processed (in this 
case small patches of natural images). The user must supply a patch width w, a dictionary size p and a value 
for T. 

2) The noisy image is decomposed into overlapping wxw patches (one patch per pixel of the image), and its 
noisy patches are used to further adapt D using the following denoising variant of (6), 

1 

(D, A) = argmin^ ||a_^-||Q , s.t. - \\yj - T>9^j\\l < Ca^ , \^k\2 = 1 , ^ = 1, • • • (7) 

Here the user must further supply a constant C (in [ ], it is 1.32), the noise variance cr^, and the number of 
iterations J of the optimization algorithm, which is usually kept small to avoid over-fitting (the algorithm is 
not allowed to converge). 

3) The final image is constructed by assembling the patches in Y = DA into the corresponding original positions 
of the image. The final pixel value at each location is an average of all the patches to which it belongs, plus 
a small fraction < A < 1 of the original noisy pixels (A = 30/cr in [ ]). 

Despite the good results obtained for natural images, several aspects of this method are not satisfactory: 

• Several parameters {w, p, r, C, J, A) need to be tuned. There is no interpretation, and therefore no justifiable 
choice for these parameters, other than maximizing the empirical performance of the algorithm ( according to 
some metric, in this case PSNR) for the data at hand. 

• The effect of such parameters on the result is shadowed by the effects of later stages of the algorithm and 
their associated parameters (e.g. overlapping patch averaging). There is no fundamental way to optimize each 
stage separately. 

As a partial remedy to the first problem, Bayesian sparse models were developed (e.g., [ ]) where these parameters 
are assigned prior distributions which are then learned from the data. However, this approach still does not provide 
objective means to compare different models (with different priors, for example). Further, the Bayesian technique 
implies having to repeatedly solve possibly costly optimization problems, increasing the computational burden of 
the application. 

As mentioned in the introduction, this work proposes to address the above practical issues, as well as to provide 
a new angle into dictionary learning, by means of the MDL principle for model selection. The details on how this 
is done are the subject of the following sections. 

HI. Sparse model selection and MDL 

Given data Y, a maximum support size 7 and a dictionary size p, traditional sparse modeling provides means 
to estimate the best model M = (A, D) for Y within the set A^(7,p) defined as 

X(7,p) := {(A,D) : ||a,||o < 7,i = 1, • • • , n, D G R'"^^'} . (8) 

We call such set a sparse model class with hyper-parameters (7,^). Such classes are nested in the following way: 
first, for a fixed dictionary size p we have yW(7 — C yW(7,p). Also, for fixed 7, if we consider A^(7,p — 1) 
to be a particular case of A^(7,p) where the p-th atom is all-zeroes and a^j = 0, Vj, then we also have that 
7W(7,P- 1) C 7W(7,P). 

If one wants to choose the best model among all possible classes A^(7,p), the problem becomes one of model 
selection. The general objective of model selection tools is to define an objective criterion for choosing such model. 
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In particular, MDL model selection uses codelength as such criterion. More specifically, this means first computing 
the best model within each family as 

(A(7,p),D(7,p)) = argmin{L(Y,A,D) : (A, D) G A^(7,p)}, 

and then choosing (7,^) = argmin {L(Y, A(7,p), D(7,p)) : < 7 < p > 0}. 

When D is fixed, which is the case of sparse coding, the only model parameter is A, and we have p+1 possible 
classes, M{'y) = {A : ||aj ||q < 7, j = 1, . . . , n}, one for each < 7 < p. If each data sample yj from Y is encoded 
independently, then, as with traditional sparse coding (the framework can also be extended to collaborative models), 
the model selection problem can be broken into n sub-problems, one per sample, by redefining the model class 
accordingly as M{j) = {a : ||a||Q < 7}. Clearly, in the latter case, the optimum 7 can vary from sample to sample. 

Compared to the algorithm in Section II-C, we have a fundamental, intrinsic measure of the quality of each 
model, the codelength L(Y, A,D), to guide our search through the models, and which is unobscured from the 
effect of possible later stages of the application. In contrast, there is no obvious intrinsic measure of quality for 
models learned through (8), making comparisons between models learned for different parameters (patch width w, 
regularization parameter r, norm r, constants C, A) possible only in terms of the observed results of the applications 
where they are embedded. The second advantage of this framework is that it allows to select, in a fundamental 
fashion, the best model parameters automatically, thus resulting in parameter-free algorithms.^ 

Such advantages will be of practical use only if the resulting computational algorithms are not orders of magnitude 
slower than the traditional ones, and efficient algorithms are a critical component of this framework, see Section V. 

A. A brief introduction to MDL 

For clarity of the presentation, in this section we will consider D fixed, and a single data sample y to be encoded. 
The Minimum Description Length principle was pioneered by Rissanen [ ] in what is called "early MDL," and 
later refined by himself [ ] and other authors to form what is today known as "modern MDL" (see [ ] for an 
up-to-date extensive reference on the subject). The goal of MDL is to provide an objective criterion to select the 
model M, out of a family of competing models M, that gives the best description of the given data y. In this case 
of sparse coding with fixed dictionary we have M = a. 

The main idea of MDL is that, the best model for the data at hand is the one that is able to capture more 
regularity from it. The more regularity a model captures, the more succinct the description of the data will be under 
that model (by avoiding redundancy in the description). Therefore, MDL will select the best model as the one that 
produces the shortest (most efficient) description of the data, which in our case is given by L(y,a). 

As mentioned in Section I, MDL translates the problem of choosing a codelength function L( ) to one of 
choosing probability models by means of the ideal Shannon codelength assignment L(y,a) = — logP(y,a). It is 
common to extend such ideal codelength to continuous random variables x with probability density function p(x) 
as L{x) = — logp(x), by assuming that they will be quantized with sufficient precision so that 

P{[x]s) ~ p{x)5, (9) 

and disregarding the constant term — log 5 in L{x), as it is inconsequential for model selection. However, in our 
framework, the optimum quantization levels will often be large enough so that such approximations are no longer 
valid. 

To produce a complete description of the data y, the best model parameters M used to encode y need to be 
included in the description as well. If the only thing we know is that M belongs to a given class M, then the cost 
of this description will depend on how large and complex A4 is. MDL will penalize more those models that come 
from larger (more complex) classes. This is summarized in one of the fundamental results underlying MDL [ ], 
which establishes that the minimum number of bits required for encoding any data vector y using a model from 
a class M has the form L>t(y) = >C>t(y) + C(7W), where CM{y) is called the stochastic complexity, which 
depends only on the particular instance of y being encoded, and C{M) is an unavoidable parametric complexity 
term, which depends solely on the structure, geometry, etc., of the model class M. 

^For the case of image processing, the patch width w is also a relevant parameter that could be automatically learned with the same 
MDL-based framework presented here. However, since it is specific to image processing, and due to space constraints and for clarity of the 
exposition, it will not be considered as part of the model selection problem hereafter. 
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In the initial version of MDL [ ], the parameter M was first encoded separately using L{M) bits, and then y 
was described given M using L{y\M) bits, so that the complete description of y required L(y\M) + L(M) bits. 
This is called a two-parts code. An asymptotic expression of this MDL was developed in [ ] which is equivalent 
to the BIC criterion [ ], only in the asymptotic regime. As we will see next, modern MDL departs significantly 
from this two-parts coding scheme. 

B. Modem MDL and universal coding 

The main difference between "early" [ ] and "modem" [^], [^] MDL is the introduction of universal codes as 
the main building blocks for computing codelengths. In a nutshell, universal coding can be regarded as an extension 
of the original Shannon theory to the case where the probability model P( ) of the data to be encoded is not fully 
specified, but only known to belong to a certain class of candidate probability models M (recall that classic Shannon 
theory assumes that P( ) is perfectly known). For example, M can be a family of parametric distributions indexed 
by some parameter M. Akin to Shannon theory, for an encoding scheme to be called universal, the codelengths it 
produces need to be optimal, in some sense, with respect to the codelengths produced by all the models in M. 

Various universality criteria exist. For example, consider the codelength redundancy of a model Q( ), 7^(y; Q) = 
— log Q(y) — [arg minp^7\4 — log P(y)] . In words, this is the codelength overhead obtained with Q(-) for describing 
an instance y, compared to the best model in M that could be picked for y, with hindsight of y. For example, if M 
is a parametric family, such model is given by the maximum likelihood (ML) estimator of M. A model Q( ) is called 
minimax universal, if it minimizes the worst case redundancy, TZ{Q) 7^(y; Q). One of the main 

techniques in universal coding is one-part coding, where the data y and the best class parameter M are encoded 
jointly. Such codes are used in the line of work of "MDL denoising" due to Rissanen and his collaborators [14], 
[15]. However, applying one-part codes at this level restricts the probability models to be used.^ As a consequence, 
the results obtained with this approach in such works are not competitive with the state-of-the-art. Therefore, in 
this work, we maintain a two-parts encoding scheme (or three parts, if D is to be encoded as well), where we 
separately describe a, D, and y given (a, D). We will however use universal codes to describe each of these parts 
as efficiently as possible. Details on this are given in the next section. 

IV. Encoding scheme 

We now define the models and encoding schemes used to describe each of the parts that comprise a sparse model 
for a data sample y; that is, the dictionary D, the coefficients a, and the approximation error e = y — Da (which 
can include both the noise and the model deviation), which can be regarded as the conditional description of y 
given the model parameters (a, D). The result will be a cost function L(y) of the form (note that y = Da + e can 
be fully recovered from (e, a, D)), 

L(y, a, D) = L(e|a, D) + L(a|D) + L(D). 

While computing each of these parts, three main issues need to be dealt with: 

1) Define appropriate probability models. Here, it is fundamental to incorporate as much prior information as 
possible, so that no cost is paid in learning (and thereby coding) already known statistical features of the data. 
Examples of such prior information include sparsity itself, invariance to certain transformations or symmetries, 
and (Markovian) dependencies between coefficients. 

2) Deal with unknown parameters. We will use universal encoding strategies to encode data efficiently in terms 
of families of probability distributions. 

3) Model the effect of quantization. All components, e, a, D need to be quantized to some precisions, respec- 
tively Se^Sa^Sd, in order to obtain finite, realistic codelengths for describing y (when the precision variable 
is obvious from the argument to which it is applied, we drop it to simplify the notation, for example, we 
will write [e]^^ as [e]). This quantization introduces several complications, such as optimization over discrete 
domains, increase of sparsity by rounding to zero, increase of approximation error, and working with discrete 
probability distributions. 

^In particular, those used in [ ], [ ] are based on the NormaHzed Maximum LikeUhood (NML) universal model [ ], which requires 
closed-form MLE estimators for its evaluation, something that cannot be obtained for example with a Laplacian prior on a and non-orthogonal 
dictionaries. 
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Fig. 1: Encoding of the sparse code, (a) After quantization (here 5^ = 1), each coefficient ak is decomposed into 
three variables, 2:/e = l(a/e), 5jt = sgn(a/e) and 1;^^ = max{|a/e| — 5a, 0}. These are respectively modeled by random 
variables Z ^ Ber(pa), S ^ Ber(l/2), V ^ Exp(0a) (only the shaded numbers are actually encoded) (b) Scheme 
of the mapping from continuous coefficients (random variable A), into Z, S and V. (c) Ideal codelength for the 
MOE model for V, — \ogqy{V] This is a smooth, concave function. 



All such issues need to be considered with efficiency of computation in mind. The discussion will focus first on the 
traditional, single-signal case where each sample y is encoded separately from the rest. At the end of this section, 
we will also discuss the extension of this framework to a multi-signal case, which has several algorithmic and 
modeling advantages over the single-signal case, and which forms the basis for the dictionary learning algorithms 
described later. 

A. Encoding the sparse coefficients 

Probability model: Each coefficient in a is modeled as the product of three (non-independent) random variables, 
A = ZS{V + 5a). where Z = 1 implies A 7^ 0, 5 = sgn(A), and V = max{|A| - 5a, 0} is the absolute value of 
A corrected for the fact that V > 5a when Z ^1.^ 

We model Z as a Bernoulli variable with P(Z = 1) = p^. Conditioned onZ = 0, 5 = 1^ = with probability 
1, so no encoding is needed.^ Conditioned on Z = 1, we assume V{S = —1) = V{S = 1) = 1/2, and V to be 
a (discretized) exponential, Exp(0a)- With these choices, V{SV\Z = 1) is a (discretized) Laplacian distribution, 
which is a standard model for transform (e.g., DCT, Wavelet) coefficients. This encoding scheme is depicted in 
Figure l(a,b). The resulting model is a particular case of the "spike and slab" model used in statistics (see [ ] and 
references therein). A similar factorization of the sparse coefficients is used in the Bayesian framework as well [ ,]. 

Unknown parameters: According to the above model, the resulting encoding scheme for the coefficients (sparse 
code) is a three-parts code: L(a) = L(z) + i^(s|z) + L(a|s,z). The support z is described using the enumerative 
two-parts code [31], which first describes its size, ||a||Q, using \ogp bits, and then the particular arrangement of the 
ones in z using log (,, \ ) bits. The total codelength for coding z is then L(z) = logp+log (,, \ ) . This is a universal 

\||a||Q/ \||a||Q/ 

encoding scheme, and as such is more efficient than those used previously in [11], [13]. Then, L(s|z) = ||a||Q bits 
are needed to encode sj^j, the actual signs of the non-zero coefficients. Finally, we need to encode the magnitudes 
of the ||a||Q non-zero coefficients, ajzj. We do so by considering it first as a sequence of exponentially-distributed 
continuous random variables, to which quantization is applied later. Since the parameter 9a of the exponential is 
unknown,^ we use a universal model qy( ) for the class of continuous exponential distributions instead. We obtain 

"^Note that it is necessary to encode S and V separately, instead of considering S'V as one random variable, so that the sign of A can be 
recovered when \V\ — 0. 

^We can easily extend the proposed model beyond S = V — ^ and consider a distribution for V when Z = 0. This will naturally 
appear as part of the coding cost. This extends standard sparse coding to the case where the non- sparse component of the vector are not 
necessarily zero. 

^This parameter is related to the sparsity level, and as discussed in Section II-C, is usually assumed known or determined via cross- 
validation. Following [ ], here we use tools from universal modeling, which permit to also automatically handle the non-stationarity of this 
parameter and its expected variability for different non-zero entries of a. 
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such universal model qv{V) via a convex mixture, one of the standard techniques for this, 




/• + 00 



where the mixing function r(^;K,/3) = r(/^) ^9'^ ^/3^e is the Gamma density function of (non-informative) 



shape and scale parameters k. and /3. With this choice, (10) has a closed form expression, and the degenerate cases 
9 = and 6> = oc are given zero weight. The resulting Mixture of Exponentials (MOE) density function qv{V), is 
given by (see [ ] for details). 



Note that the universality of this mixture model does not depend on the values of the parameters hia^Pa^ and guided 
by [ 2], we set = 3.0 and = 50. The ideal Shannon codelength for this density function distribution is given 
by -logqy(y; = -logA^^ - /^a log /3a + {i^a + ^)^og{V + 13a)- This function, shown in Figure 1(c), is 

non-convex, however continuous and differentiable for V > 0. 

Quantization: On one hand, quantizing the coefficients to a finite precision 5a increases the approximation/modeling 
error, from y — Da to y — D [a]^ . This additional error, D(a — [a]^ ), will clearly increase with 5^. On the other 
hand, larger 5a will reduce the description length of the non-zero values of the coefficients, [v]^ . In practice, for 
reasonable quantization steps, the error added by such quantization is negligible compared to the approximation 
error. For example, for describing natural images divided in patches of 8x8 pixels, our experiments indicate that 
there is no practical advantage in using a value smaller than 5^ = 16. Consequently, our current algorithms do 
not attempt to optimize the codelength on this parameter, and we have kept this value fixed throughout all the 
experiments of Section VII. 

B. Encoding the error 

Probability model: Most sparse coding frameworks, including all the mentioned MDL-based ones, assume the error 
e to be solely due to measurement noise, typically of the AWGN type. However, e actually contains a significant 
component which is due to a systematic deviation of the model from the clean data. Following this, we model the 
elements of e as samples of an IID random variable E which is the linear combination of two independent variables, 
E = E + N. Here N ^ JV{0,al) represents random measurement noise in y. We assume the noise variance 
known, as it can be easily and reliably estimated from the input data (for example, taking the minimum empirical 
variance over a sufficient number of sub-samples). The distribution of the second variable, E ^ Lap(0, 9e) is a 
Laplacian of unknown parameter 9e, which represents the error component due to the model itself. The resulting 
continuous distribution pe{E), which we call "LG," is the convolution of the distributions of both components (see 
[33] for details on the derivation). 



shown in Figure 2(a) for various parameter values. This function is convex and differentiable on R, which is nice 
for optimization purposes. Figure 2(b) shows its derivative, or so called "influence function" in robust statistics. It 
can be verified that — logp^(£^) behaves like a Laplacian with parameter 9e for large values of E. Further, since 
its derivative is bounded, the influence of outliers is diminished. In fact, — logp^(£^) is easily verified to be a 
^-type M-estimator, a family of functions used in robust statistics (see [. ]). Thus, using this model, we obtain an 
information-theoretic robust estimator, which is consistent with the motivations leading to its use in our framework, 
and which has a significant practical impact in the experimental results. 

Unknown parameters: Since 9e is unknown, encoding e efficiently calls for the use of universal codes. In this 
case, again, we employ a mixture model. Since the parameter 9e comes from the underlying Laplacian component, 
we again use a Gamma for the mixing function. 
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Fig. 2: Residual probability model, (a) Ideal codelength function of the "LG" distribution, — logp^(£^), (b) LG 
influence function, that is, (— logp^(y))^ (c) universal mixture for the LG model (MOEG), (d) MOEG influence 
function. 



We call this model MOEG. As with the MOE model, the universality of this model is guaranteed by the theory for 
the choice of its underlying mixing function, for any (non-informative) and /3e. In this case, we use A^g = 3.0 and 
/3e = 5e. Also, we know from the discussion above that cr^ can be easily and reliably estimated from the data. Thus, 
we can say that the model for E is parameter-free in this case as well. Figure 2(c) shows the numerical evaluation 
of the ideal Shannon codelength — logq^(£^; a^, /^e, /3e), which is non-convex. However, it is twice differentiable 
everywhere, again a desirable property for optimization purposes (more on this in sections V and VI). As with the 
LG distribution, — logq^(£^) is an ^-type M-estimator, in this case, a redescending M-estimator, since its derivative 
(Figure 2(d)) vanishes to at oc. As such, — logq^(£^), derived from the universal model corresponding to p^(£^), 
can reject outliers even more aggressively than — logp^(£^), again marrying robust statistics with information 
theory in a natural way. 

Quantization: To losslessly encode finite-precision input data such as digital images, the quantization step of the 
error coefficients needs not be more than that of the data itself, 5y, and we simply quantize the error coefficients 
uniformly with step 5y. For example, for 8-bit digital images, we set 5e = 5y = L 

C. Model for the dictionary 

Probability model: Dictionary learning practice shows that learned atoms, unsurprisingly, present features that 
are similar to those of the original data. For example, the piecewise smoothness of small image patches is to be 
expected in the atoms of learned dictionaries for such data. This prior information, often neglected in dictionary 
learning algorithms, needs to be taken into account for encoding such atoms efficiently. 

We embody such information in the form of predictability. This is, we will encode an atom d G as a 
sequence of causal prediction residuals, b G R^, 6^+1 = d^+i — di+i(di,d2, . . . , di), 1 < i < m, a function of 
the previously encoded elements in d. In particular, if we restrict d^+i to be a linear function, the residual vector 
can be written as b = Wd, where W G W^^^ is lower triangular due to the causality constraint (this aspect has 
important efficiency consequences in the algorithms to be developed in Section VI). This is depicted in Figure 3, 
along with the specific prediction scheme that we adopted for the image processing examples in Section VII. In 
this case we consider an atom d to be an ^/mx^/rn image patch, and use a causal bi-linear predictor where the 
prediction of each pixel in the dictionary atom is given by north_pixel + west_pixel — northwest_pixel. 

As a general model for linear prediction residuals, we assume b to be a sequence of IID Laplacian samples of 
parameter Od- In principle, 9^ is also unknown. However, describing D is only meaningful for dictionary learning 
purposes, and, in that case, D is updated iteratively, so that when computing an iterate D^^^ of D, we can use 
D(^~^) to estimate and fix 6^ via ML (more on this 6^ later in Section VI). Thus, we consider 6^ to be known. 
Quantization: When A is fixed during a dictionary learning iteration (which consists of an alternate descent 
between D and A), we can view (A, Y) as n input-output training pairs, and D as the ML estimator of the linear 
coefficients describing such mapping via Y = DA + E. Based on this, we use the quantization step 5^ = l/\/n, 
which is an optimal step for encoding the ML parameter in two-part codes, as described in [ , Theorem 1]. 
Computation: Computing L{T>) is only relevant for learning purposes. In general, since ||d/e||2 < 1, and \\dk\\2 < 
^Jm \\dk\\i, we have that 9^ = {pm)~^ J2k ll^^lli — {pV^)~^ <^ = \/^' the error of using the approxima- 
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Fig. 3: Prediction scheme used for learning natural image patches dictionaries (in this example, 3x3 patches, and 
m = 9). An atom dk is arranged as a 3x3 patch, and a causal bi-linear predictor (shown as a 2x2 template) with 
zero-padding (pixels outside of the patch are assumed 0) is applied to it, producing a predicted atom d^^^^ and a 
residual = — dk- The previous operation can be written as b^^^^ = Wd^^^^, with W G R^^^ the linear mapping 
from atom to prediction residuals corresponding to this example. 



tion (9) is not significant, 

P P p 

L(D) = J2 ^(d^) ^ E Pi^dk; 9d) - m log 5a} - 0dJ2 II Wd, || ^ + log n + c, (13) 

k=l k=l k=l 

where p(Wd/e) is the IID Laplacian distribution over the fc— th atom prediction residual vector Wd^^^^, and c is a 
fixed constant. For p fixed (we will later see how to learn the dictionary size p as well), the above expression is 
simply an ii penalty on the atom prediction residual coefficients. As we will see in Section VI, this allows us to 
use efficient convex optimization tools to update the atoms. 

D. Extension to sequential (collaborative) coding 

One natural assumption that we can often make on the set of data samples Y is that, besides all being sparsely 
representable in terms of the learned dictionary D, they share other statistical properties. For example, we can 
assume that the underlying unknown model parameters, 9e, pa, Oa, 0^, are the same for all columns of the sparse 
data decomposition (E, A). 

Under such assumption, if we encode each column of Y sequentially, we can learn statistical information from 
the ones already encoded and apply it to estimate the unknown parameters of the distributions used for encoding 
the following ones. The general idea is depicted in Figure 4(a). Concretely, suppose we have already encoded j — 1 
samples. We can then use [ei, 62, . . . , ^{j-i)] to estimate 9e, and [ai, a2, . . . , ^{j-i)] to estimate 9a and pa, and 
"plug-in" these parameters to encode the j-th sample. This justifies the name of this encoding method, which is 
known in the coding literature as sequential plug-in encoding. This encoding strategy has several advantages: 1) 
For common parameter estimators such as ML, this method can be shown to be universal; 2) Since all distribution 
parameters are fixed (pre-estimated) when encoding the j-th sample, we can use the "original," non-universal 
distributions assumed for modeling ej (LG) and (Laplacian), which have closed forms and are usually faster to 
compute (together with (9)) than their universal mixture counterparts; 3) Furthermore, these original distributions 
are convex, so that in this case, given a fixed support, we are able to exactly minimize the codelength over the 
non-zero coefficient values; 4) With many samples available for parameter estimation, we can potentially afford 
more complex models. 

Residual: We estimate 9e in two steps. First, since the random variable E is an independent sum of two random 
variables, E = E + N, have that var(£^) = var(^) + var(A^) = var(^) + a^. Now, since E is Laplacian, we 

have that var(^) = 29^. Combining both equations we have that 9e = 0.5^var(^) — a^. With the noise variance 

(jg assumed known, and using the standard unbiased variance estimator, var(£') = (p(j — ||E[i ||^, 

we obtain 

Oe = 0.5^max{(p(j - l))-i ||E[i,...,(,_i)] ||^ - a^O}, 
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Fig. 4: Collaborative encoding scheme, (a) In this example, 3 samples have already been encoded, and we are 
about to encode sample 4. The formulas for estimating the various model parameters are shown for j — 4, in 
particular those for the error and the coefficients associated to the fc-th atom (the fc-th row of A), (b) Markov 
model for the coefficients support matrix Z. Here, a sample patch y is about to be encoded. Here the first atom 
was only used by the pixel to the west, so that the Markov state for modeling zi is {n^w^nw) = (0, 1,0), and 
P{zi = 1) = P(oio)- atom, only the nw pixel has used it, so that the Markov state for Zk is 

(0,0,1), that is, p(zk = 1) = p^^^^^^y 



where the maximization guarantees that 9e G M^. 

Coefficients: In the case of a, we have in principle two unknown parameters, the probability of an element being 
non-zero, pa, and the scale parameter of the Laplacian governing the non-zero values, 9a (both previously handled 
with universal models). Here, however, we extend the model, drawing from the well known empirical fact that 
coefficients associated to different atoms can have very different statistics, both in frequency and variance. This 
is typical of DCT coefficients for example (see [ ]), and has been consistently observed for learned dictionaries 
as well [ ]. Therefore, we will consider a separate set of parameters {Pa^O^) for each row k of A, a^. We 
update such parameters from the coefficients observed in the respective row for the already-computed samples, 
{ciki^CLk2^ . . . , CL]^[j-i)), and encode each k-ih coefficient in aj (more specifically, in Zj, and Vj), as the j-th sample 
of the respective row. Concretely, let n\ = Xljti^ ^kj' be the number of non-zero coefficients observed so far in 
the fc-th row. For p^, we use the Krichevsky-Trofimov (KT) estimator [ ], 

p; = =i±^. (.4) 

which is a universal plug-in encoding scheme for Bernoulli sequences of unknown parameter. For encoding v^j, 
we apply the ML estimator for the exponential family to the non-zero coefficients observed so far in the fc-th row. 
Recalling that Vkj = max{|a/ej/| — 5a, 0}, the resulting estimator is given by 

^^j'=i inax{|a/ej'| - Sa, 0} 
^« k • 

Markovian dependencies: In many applications, spatially/temporally adjacent samples are statistically dependent. 
For example, we may assume that an atom is more likely to occur for a sample j if it has been used by, say, the 
(j — l)-th sample (see also [ ]). In that case, we may consider different estimations of p^ depending on the value 
of z^j_i^, pi = = l\z^j_i^ = 1), and p^ = P(^/cj = lk/c(j-i) = 0)- particular, for the image processing 

results of Section VII, we use a Markovian model which depends on three previous samples, corresponding to the 
(causal) neighboring west, north, and northwest patches of the one being encoded. Thus, for each atom k we will 
have 8 possible parameters, p^^^ {n,w,nw) G {0,1}^, where each value of {n,w,nw) indicates a possible 
Markov state in which a sample may occur. This is depicted in Figure 4(b). For each state (n, nw), we estimate 
P\nwnw) ^^ii^g (1^4), with the average taken over the samples which occur in the same state (n^w^nw). 
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Algorithm 1: COdelength Minimizing Pursuit Algorithm (COMPA) 



Input: Data sample y, dictionary D 
Output: a, e 

1 initialize t ^ 0; a^^) ^ 0; e ^ y; L^^) ^ L(y, 0); g^*) ^ D^e^*) ; // g^*) correlation of current residual with the dictionary 

2 repeat 



3 
4 
5 
6 
7 
8 
9 

10 + 1 

11 until L(*) > L(*-i) 

12 a ^ a(7 - 1) ; 

13 e ^ [y - Da] ; 

14 STOP ; 



Afc ^ [5'fc*'*](5a ' ^ "^^^i^ ^fc correlation, quantized to prec. Sa 



for ^ 1, 2, . . . , p do 

Lfc ^ L([e — Afcdfc]^^, a + A^ook) ; ^k = k-th canonical vec. of 

end 

L(*+i) ^ min{Lfc : /c = 1, . . . , p} ; 

^ g^(t) \ // update coefficients vector 

g(t+i) ^ g(t) _ A^d^ ; //update correlation 



V. MDL BASED SPARSE CODING 

For the encoding problem, D is fixed (it has already been learned), and we consider encoding a single data sample 
y. The model selection problem here is that of choosing the model (indexed by the sparse code a) among all the 
models belonging to the nested family of model classes 7W(7) = {a G R^, ||a||Q < 7} ,7 = 0, . . . that yields 
the smallest codelength for describing y. In principle, this calls for finding the best model a(7) within each model 
class 7W(7), and then selecting a = argmino<7<p L(y, a(7)). However, in order to be computationally efficient, 
and as with most sparse coding and model selection algorithms, several simplifications and approximations are 
needed. Let us first consider the problem of finding a(7), 

a(7) := arg min L(y, a) = arg min — logP^(y — Da) — logP(z) — logP(s|z) — logPy(a|s, z) 

= arg min -logP£;(y - Da) - log ( ) + ||a|L - logPy(a) s.t. ||a|L < 7. (15) 
aGRp \||a||Qy 

For quantized a, this is an optimization problem over a discrete, infinite domain, with a non-convex (in the continuous 
domain) constraint, and a non-differentiable cost function in a. Based on the literature on sparse coding, at least 
two alternatives can be devised at this point. One way is to use a pursuit technique, e.g., [ ]. Another option is 
to use a convex relaxation of the codelength function, e.g., [ ]. For the sake of brevity, here we will describe 
an algorithm loosely based on the first alternative. Details on the convex relaxation method for MDL-based sparse 
coding will be published elsewhere. 

The pursuit-like algorithm, which we call COdelength-Minimizing Pursuit Algorithm (COMPA), is summarized 
in Algorithm 1. This is a non-greedy cross-breed between Matching Pursuit (MP) and Forward Stepwise Selection 
(FSS) [ ]. As with those methods, COMPA starts with the empty solution a^^^ = 0, and updates the value of one 
single coefficient at each iteration. Then, given the current correlation g(^) = De^^^ between the dictionary atoms 



9 k 



and the current residual, each fc-th coefficient in a^^^ is tentatively incremented (or decremented) by /S.j^ = 

and a candidate codelength Lk is computed in each case. The coefficient that produces the smallest L(y, a) is 
updated to produce a(^+^\ 

The logic behind this procedure is that the codelength cost of adding a new coefficient to the support is usually 
very high, so that adding a new coefficient only makes sense if its contribution is high enough to produce some 
noticeable effect in the other parts of the codelength. A variant of this algorithm was also implemented where, 
for each candidate fc, the value of the increment A^^ was refined in order to minimize L]^. However, this variant 
turned out to be significantly slower, and the compression gains where below 0.01 bits per sample (uncompressed 
codelength is 8 bits per sample). Assuming that L^^^ is unimodal, the algorithm stops if the codelength of a new 
iterate is larger than the previous one. To assess the validity of this assumption, we also implemented a variant 
which stops, as MP or FSS, when the residual-coefficients correlation ||g^^^||^ is no longer significant, which 
typically requires many more iterations. With this variant we obtained a negligible improvement of 0.004 bits per 
sample, while increasing the computational cost about three times due to the extra iterations required. 
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Fig. 5: Typical evolution of the COMPA algorithm, (a) coefficients, (b) codelength. The best iterate (code) is 
marked with a black circle. Also note that describing the support (L{Z)) actually takes more bits than describing 
the non-zero values (L(V)). 



Algorithm 2: MDL-based dictionary learning via forward selection. 



Input: Data Y 
Output: (A, D) 

1 initialize p ^ 0; A(0) ^ 0; D(0) ^ 0; E(0) ^ Y; L(0) ^ L(E(0), A(0), D(0)) ; 

2 repeat 

3 d ^ ui, UEV" = E(^) // Initial value of new atom is the left-eigenvector associated to the largest singular value of'E^^\ 

4 D*-* ^ [D(p) I d] //Initial dictionary for optimization below. 

5 + 1), D(p + 1)) cirg min(A,D)eA4(p+i) ^(-^5 ^ Optimize diet, via Algorithm 3 

6 p ^ p -\- 1 ; 

7 L(p)^L(E(p),A(p),D(p)) ; 

8 until L(p) > L(p - 1) ; 

9 A ^ A(p - 1); D ^ r>(p - 1); 



VI. MDL BASED DICTIONARY LEARNING 

Given that our sparse coding algorithm in Section V can select the best support size 7 for each sample in Y, the 
definition of the model class M{j^p) given in Section III, which assumes the same 7 for all samples in Y, is no 
longer appropriate (we could of course add 0- weight coefficients to make 7 equal for all data). Instead, for dictionary 
learning, we consider the model class family M{p) = {(A, D), D G W^^p^ aj G M{j; D), j = 1, . . . , n}, where 
A^(7; D) is the model class family of sparse codes based on a fixed dictionary D defined in Section V, with the 
dependency on D made explicit. It is easy to see that the model classes M{p) are nested. We now need to solve 

(A(p),D(p)) = arg min L(E,A,D), (16) 

(A,D)g7W(p) 

for p = 0, 1, . . ., and then choose (A, D) = (A(j3), D(^)) with the optimal dictionary size 

p = argmin {L(E, A(p), D(p)) : p = 0, 1, . . .} . 

p 

As with sparse coding, here we exploit the nested nature of the model classes to speed up the model selection. 
For this, we propose a forward- selection algorithm, described in Algorithm 2, which starts from A4{0) (the empty 
dictionary), and then approximates the best model in A4(p+ 1) by adding a new atom to the dictionary computed 
for M{p) and then invoking Algorithm 3, which is discussed in depth in the next subsection. 

A backward- selection algorithm was also developed which first learns the model for A^(pmax) via (3), where 
Pmax is a given maximum dictionary size, and then prunes the less frequently used atoms until no further decrease 
in codelength is observed. This algorithm allows us to provide especially-constructed initial dictionaries for Algo- 
rithm (3), e.g., an (overcomplete) DCT frame, which can be critical for finding good local minima of the non-convex 
problem (16). We do this for example to learn a dictionary for the whole class of natural images, see Section VII. 
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Algorithm 3: MDL-based dictionary learning for a given size p 



Input: Data Y, initial dictionary D*-*, multiplier A, 77 
Output: Local-optimum (A, D) 

1 initialize D(o) = D^, t = 1 ; 

2 repeat 



for J = 1, . . . , n do 

I a^*^ ^ argminAi-(e,a,D(*-i)) ; 
end 

Update plug-in parameters: ^e, {{^a, Pa)^ k = 1, . . . ,p} , Od', 
D(*) ^ arg minD ^(E, A, D) ; 

until - — ii^mi — - < e ; 



A. Optimizing the dictionary for fixed p 

For fixed p, and given an initial D, Algorithm 3 adapts the atoms of D to fit the training data Y. At the high 
level, our algorithm is very similar to the traditional approach of alternate minimization over (A, D). However, there 
are a number of important differences, namely: 1) The cost function minimized is now the cumulative codelength 
of describing Y, L(E, A,D); 2) Minimizing over A is done sample by sample following Section V; 3) Since D 
needs to be described as well, it has an associated codelength (see Section IV-C), resulting in regularized dictionary 
update, described below; 4) in a cross-breed between Expectation-Maximization, and plug-in estimation, we estimate 
the model parameters for the current iterate (E^^), A^^), D^^^), from the accumulated statistics of previous iterates 
|(E(^OA(^'),D(^')),t' = 1, . . . ,t - l}. At the end of the learning process, these parameters are "saved" as part of 
the learned model and can be used for modeling future data along with D. 

At the t-th iteration of the alternate minimization between D and A, with A^^) just computed and kept fixed, 
the dictionary step consists of solving the sub-problem 

DW = arg min L(Y, A^, D) = arg min L(Y| A^, D) + L(D). 

According to Section IV-C, we have L(D) = ^ ELi WMi^ where of = ^ ELi i is the 

Laplacian MLE of Od based on D^^"^). Correspondingly, the data fitting term, via (9) and disregarding the constant 
terms, is given by L(Y|AW,D) = L(Y - DAW|#,a2) = E;=i EI^i " log i^Gly^, - (DAW),,; #, a^), 

where 9^^ is the estimator of 9e given E = Y — D^^^^^A^^^ (see Section IV-D) and al is assumed known. The 
problem can now be written as, 

p 

D^*) = argminL(Y - DA^I^^^ a^) + of ^ ||Wdfc||i . (17) 
^ k=i 
For general W, the optimization of (17) is challenging since none of the above terms are separable, in particular, 
the non-differentiable li term. However, since W is easily invertible (as described in Section IV-C, it is lower 
triangular with I's in the diagonal), we can perform a change of variables and solve the equivalent problem in the 
prediction residual matrix U — WD instead, 

p 

U = argminL(Y - W-^UA^ a^) + of ^\\^k\\i- (18) 
^ k=i 

Since the regularization term in (18) is decoupled in the elements of U, and L(Y — W~^UA|0e^\ cTg) is convex 
and differentiable in U (see Figure 2(a)), (18) can be efficiently solved using the existing techniques for separable 
non-differentiable regularization terms. In our case, we employ the backtracking variant of FISTA [23], focusing 
on an efficient numerical evaluation of each step. 

VII. Experimental results 

A. Coding performance 

The first experiment in this section assesses the ability of our coding scheme to actually produce a compressed 
description of the data, in this case 8-bit gray-scale images. To this end, a dictionary D was learned using the 
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backward- selection algorithm, for the training samples from the Pascal'OG image database, ^ converted to 8-bit 
gray-scale images and decomposed into 8x8 patches. The initial dictionary was an overcomplete DCT frame with 
p = 256. The resulting global dictionary D has p = 250 atoms. We then encoded the testing samples from the 
same database, obtaining an average codelength of 4.1 bits per pixel (bpp), confirming the ability of our model to 
produce a compressed description of the data. 

B. Learning performance 

We compare the performance of the forward and backward dictionary learning algorithms proposed in Section VI 
by applying each method to learn a dictionary for the standard "Boats" image (taken from the SIPI database, ^ along 
with "Lena," "Barbara" and "Peppers" used in the following experiments), and then measuring the final codelength 
and computation time. For the backward case, the initial dictionary is the global dictionary learned in the previous 
experiment. As for the forward method, we also include a faster "partial update" variant which performs a few 
(10) iterations of Algorithm 3 after adding a new atom, instead of allowing it to converge. The backward method 
produced a dictionary of size p = 170, yielding a compression level of 5.13bpp at a computational cost of 3900s. 
For the convergent forward method, a dictionary with p = 34, yielding 5.19bpp and requiring 800s. Finally, the 
forward method resulted in a dictionary of size p = 20, 5.22bpp, and required 150s. In all cases, the running 
times were measured for a parallelized C-i~i- implementation running on an Athlon Phenom II X6 at 2.6GHz). In 
summary, all three methods reach similar, significant, compression levels. Slightly better results are obtained with 
the backward method, at the cost of a significant increase in computational time. On the other hand, the partial 
forward variant is significantly faster than the other two, yielding similar codelengths. 

C. Denoising of natural images 

The task in this case is to estimate a clean image from an observed noisy version whose pixels are corrupted by 
AWGN of known variance a^. Here Y contains all (overlapping) 8x8 patches from the noisy image. The denoising 
algorithm proceeds in two stages. In the first one, a dictionary D is learned from the noisy image patches Y. We 
use the backward selection algorithm since it allows us to use the global dictionary as the starting point, a common 
practice in this type of problems, [25], [27]. Secondly, the clean patches are estimated as sparse combinations of 
atoms from D. In our case, the second stage admits two variants. The first one is a rate-distortion (RD) procedure 
akin to the traditional method used for example in [ ], where each clean sample yj is estimated using a distortion- 
constrained formulation. In our case, we minimize the codelength (or "rate") of describing up to a prescribed 
distortion proportional to the noise level, yj = Daj,aj = argminuL(u) s.t. ||yj — Du||2 < Ca^. Here we 
use C = 1.0. The second variant, coined "post-thresholding" (PT) is more consistent with the learning phase, and 
is truly parameter-free, since the estimation derives from the same codelength minimization procedure used for 
learning the dictionary D. In this case we obtain an initial estimate fj = Daj, kj = argminuL(u) + L(yj|u). 
However, according to the model developed in Section IV-B, the encoding residual e = y j — yj may contain a 
significant portion of clean data due to modeling errors. We can then think of e as clean data corrupted by noise of 
variance a^. To extract the clean portion, we solve another codelength-minimization sub-problem, this time with a 
Gaussian prior for the error, and a Laplacian prior for the clean part, = argmin^ ^||ej — u||2 + J-||u||^, where 

9e = Y^0.5max{0, var(ej) — a^}, following Section IV-D. We then compute the final estimate as yj = yj + ej. In 
either variant, the model used for L(a) includes the Markovian dependency between the occurrence of each atom 
in a patch and its previously-encoded neighbors, as described in Section IV-D. 

Denoising performance is summarized in Figure 6, along with a detail of the result for (jg = 10 for the "Boats" 
image in Figure 6. In all cases, there is a 1 to 5 dB improvement over the best MDL-based results in [15], thus 
showing the relevance of overcoming the limitations in previous MDL applications to sparse coding. Both the 
RD and PT methods yield results which are comparable to those of [ ], which depend significantly on several 
carefully tuned parameters.^ While the RD variant performs better than PT in terms of PSNR, PT is faster and 

^http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html 
^http://sipi.usc.edu/database/database.php?volume=misc&image=38#top 

^To the best of our knowledge, these results, as well as those in [ ], are among the best that can be obtained for gray-scale images 
without using multi-scale and/or spatial aggregation of patches as in [ >], [40]. 
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Fig. 6: Denoising results. Left table: denoising performance, in PSNR, of K-SVD [ ], MDL denoising [15], and 
the Post-Thresholding (PT) and Rate-Distortion (RD) denoising variants. Images, top row: clean "Boats", noisy 
version, learned dictionary for this image (final p = 248), image recovered using RD. Images, bottom row: image 
reconstructed from the initial estimation fj obtained in the PT method, its residual, portion of residual that was 
added back, final PT estimation. 



tends to produce less artifacts than RD, thus resulting in more visually pleasant images than RD. This, which can 
be clearly seen in Figure 6, occurs in all other cases as well. Including the Markov dependency in L(a) produced 
an average improvement of up to 0.2dB. 

D. Texture mosaic segmentation 

Here we are given c images with sample textures, and a target mosaic of textures, and the task is to assign 
each pixel in the mosaic to one of the textures. Again, all images are decomposed into overlapping patches. This 
time a different dictionary is learned for each texture using patches from corresponding training images. In order 
to capture the texture patterns, a patch width w ^ IQ was used. Then, each patch in the mosaic is encoded using 
all available dictionaries, and its center pixel is assigned to the class which produced the shortest description length 
for that patch. 

This seemingly natural procedure results in a success rate of 77%, which is consistent with the second picture of 
Figure 7. The problem is that this procedure is inconsistent with the learning formulation, because each dictionary 
is adapted to minimize the average codelength of describing each patch in the respective texture. Therefore, good 
results can only be expected if the decision is made for groups of patches simultaneously, that is, by considering 
the cumulative codelength of a set of patches. We implement this by deciding on each patch on the basis of 
comparing the average codelength obtained with each dictionary for encoding that patch and all patches in a 
circular neighborhood with a radius of 20 pixels. The success rate in this case is 95.3%, which is comparable 
to the state-of-the art for this type of problems (see for example [ ], which learns sparse models for explicitly 
maximizing the success rate). The Markovian model improved our results by 1%. 

E. Low-rank matrix approximation 

The low-rank matrix approximation family of problems (see [^^] for a review) can be seen as an extension to 
the problem of sparse coding where sparsity is substituted by matrix rank. Concretely, the task is to recover a 
matrix A G W^^^ from an incomplete and/or corrupted observation Y, under the assumption that the rank of A, 
rank(A), is small. As with sparse coding, rank(A) is relaxed using the li equivalent for matrix rank, which is 
the nuclear norm, ||A||^ := Xli'^^l^)' where cFi{A) is the z-th singular value of A. It has been shown in [42] 

^^Taken from http://www.ux.uis.no/~tranden/. 
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Fig. 7: Left to right: Texture mosaic, dictionaries learned for each class (note the automatically learned different 
sizes), patch-wise codelength-based classification map -each shade of gray corresponds to a texture class - (77.0% 
success rate), classification map obtained by averaging the codelength over a neighborhood of patches (95.4% 
success rate). 



that, under certain assumptions on rank(A), the following estimation function is able to recover A from a noisy 
observation Y, and with a significant fraction of its coefficients arbitrarily corrupted, 

A argmin ||W||^ + A||Y — W||;i^, A = l/A/max{m, n}. (19) 
w 

A common proof of concept is to use this framework for robust background estimation in camera surveillance 
video sequences [ ], and we apply our proposed framework for the same application. 

To perform our MDL-based model selection within this formulation, we solve (19) for increasing values of A, 
obtaining a low-rank approximation to A, (A(A), E(A) = Y — A(A)), which we encode using the universal models 
described in Section IV. We modified the algorithm described in [ ] to allow for warm restarts, using the solution 
for the previous A as a starting point for the next A for faster convergence. 

Consistently with the ii fitting term of (19), we encode the non-zero values of E(A) as a Laplacian sequence of 
unknown parameter. To exploit the potential sparsity in E(A), the locations of the non-zero values are encoded, as 
in IV-A, using an enumerative two-parts code for Bernoulli sequences of unknown parameter. To exploit low-rank 
in the encoding, the matrix A(A) is encoded via its reduced SVD decomposition A(A) = U(A)i;(A)V(A)''". For 
rank(A(A)) = r, we have that U(A) E W^^^ are the left-eigenvectors, S G W^^ is the diagonal matrix whose 
diagonal are the non-zero singular values of A(A), and V(A) G W^^ are the right-eigenvectors of A(A). Each 
column of U is encoded (in this video example) as a smooth image via a causal bilinear predictor identical to the 
one used for predictive coding of D in IV-C, using a Laplacian model for the prediction residuals. Each column 
of V is encoded as a smooth one-dimensional sequence, using a zero order predictor (the predicted value for the 
next coefficient is the previous coefficient value), with a Laplacian prior on the prediction residuals. Finally, the 
values of E, which can be arbitrary, are quantized and encoded using the universal code for integers [ ]. 

The encoding method is very simple, with all unknown parameters encoded using a two-parts code, and code- 
lenghts for the discretized Laplacian pre-computed in look-up tables. Quantization for this case is as follows: 
the codelength associated with the r non-zero singular values is negligible, and we minimize unwanted distortion 
encoding them with high precision (le — 16). As for the columns of U and V, they all have unit norm, so that 
the average magnitude of their elements are close to \/l/m and \/l/n respectively. Based on this, our algorithm 
encodes the data with Su = Q/^/rn as the precision for encoding U, and Sy = Qj \fm for V, for several values of 
Q in (0, 1), keeping the one producing the smallest codelength. The MDL-based estimation algorithm then chooses 
the model for which the codelength L(Y; A) = L(U(A)) + i^(S(A)) + i^(V(A)) is minimized. 

As in [ ], here we show results for two sequences taken from [46]: "Lobby" (Figure 8(a)), and "ShoppingMall" 
(Figure 8(b)). Full videos can be viewed at http://www.tc.umn.edu/~nacho/lowrank/. 

In both cases, the recovered backgrounds are very accurate. In particular, for the Lobby sequence, the selected 
model captures just the eigenvectors needed to recover the background along with its lighting changes, including 
corrections for local shadows, leaving out only the people passing by. 

VIII. Concluding remarks 

We have presented an MDL-based sparse modeling framework, which automatically adapts to the inherent 
complexity of the data at hand using codelength as a metric. 
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(a) Results for "Lobby" sequence, featuring a room with lights that 
are switched off and on. The rank of the approximation for this 
case is rank =10. The moment where the lights are turned off is 
clearly seen here as the "square pulse" in the middle of the first 
two right-eigenvectors (bottom-right). Also note how U2 (top-right) 
compensates for changes in shadows. 

Fig. 8: Low-rank approximation results. Both figures show the first two left-eigenvectors as 2D images at the top, 

two sample frames from the approximation error sequences in the middle, which should contain the people that were 

removed from the videos, and the curve L(A) and the right-eigenvalues, scaled by H (representing the "activity" 

of each left-eigenvector along time), at the bottom. 



time (frame number) 700 

(b) Results for "ShoppingMall", a fixed camera looking at a crowded 
hall. In this case, the rank of the approximation decomposition is 
rank = 7. Here, the first left-eigenvector models the background, 
whereas the rest tend to capture people that stood still for a while. 
Here we see the "phantom" of two such persons in the second left- 
eigenvector (top-right). 



The framework features a sparse coding algorithm and automatic tuning of the sparsity level on a per-sample basis, 
including a sequential collaborative variant which adapts the model parameters as it processes new samples, and 
two dictionary learning variants which learn the size of the dictionaries from the data. In all cases, the information- 
theoretic formulation led to robust coding and learning formulations, including novel robust metrics for the fitting 
term (LG and MOEG), and robust ^i-based dictionary regularization term. This formulation also allowed us to easily 
incorporate more prior information into the coding/learning process, such as Markovian spatial dependencies, by 
means of simple modifications to the probability models used. 

As a result, the framework can be applied out-of-the-box to very different applications, from image denoising to 
low-rank matrix approximation, obtaining competitive results in all the cases presented, with minimal interaction 
from the user. 
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